Conversation
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Zanie Blue <contact@zanie.dev>
…ate scoring (entrius#314) Co-authored-by: root <root@135-181-76-236.ptr> Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
… for missing issues (entrius#335)
Co-authored-by: anderdc <me@alexanderdc.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
just merged big changes into test please fix conflicts thanks |
… 502 errors on large PRs (entrius#331)
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
…res (entrius#346) Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
d9ff4cd to
d8a9df5
Compare
|
@anderdc I've rebased it on current |
…ount (entrius#340) Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
…s#351) Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: anderdc <me@alexanderdc.com>
d8a9df5 to
4ade5b6
Compare
|
I have a PR going into test now that will fix that pyright type check CI, don't worry about it |
There was a problem hiding this comment.
Review
Good problem identification — test and non-code files dragging down source scores via density is a real issue and the per-category approach is the right direction. The ScoringCategory enum and the category property on FileScoreResult are clean additions. The test suite is thorough and covers the right invariants.
However, the implementation adds more scaffolding than necessary and introduces a redundant iteration. Requesting changes on the following:
1. Don't loop twice over file results
from_file_results() re-iterates all file_results to categorize and re-sum totals that the existing loop in tree_sitter_scoring.py already computes. The category is trivially known at each append site (is_test_file + scoring_method), so just accumulate into per-category dicts inline during the existing loop. No second pass needed.
2. Extend PrScoringResult instead of adding PrScoringResultCategorized
The new wrapper class duplicates fields from PrScoringResult (total_score, total_nodes_scored, score_breakdown) and adds indirection. Since PrScoringResult is only constructed in one place and consumed in one place, just add by_category: Dict[ScoringCategory, PrScoringResult] and total_lines: int to the existing class. No need for a parallel class hierarchy or the _EMPTY_SCORING_RESULT sentinel.
3. Keep scoring logic in scoring.py
calculate_initial_base_score() and calculate_contribution_bonus() are business logic that reference scoring constants (MERGED_PR_BASE_SCORE, MAX_CONTRIBUTION_BONUS, CONTRIBUTION_SCORE_FOR_FULL_BONUS). This logic currently lives in scoring.py alongside the rest of the scoring decisions. Moving it into a dataclass in classes.py scatters scoring logic across two files. Keep it in scoring.py — the caller can iterate by_category directly with a few lines.
4. Threshold check may not isolate categories correctly
In calculate_initial_base_score(), the threshold uses self.score_breakdown.total_score — the aggregate across all categories including test and non-code. A PR with trivial source but a large test suite could pass the threshold via test score alone. The existing tests for this (test_tests_do_not_affect_threshold) pass token_score=0.0 explicitly on the factory, so they don't catch this path. The intent is that only SOURCE contributions matter for the threshold, the check should use the source category's score, not the aggregate.
5. Preserve diagnostic detail in log output
The current log shows density and bonus percentage:
Base score: 15.00 (density 0.50) + 3.0 bonus (10% of max 30) = 18.00
The PR reduces this to:
Base score: 15.00 + 3.0 bonus = 18.00
With per-category scoring, the density breakdown is actually more useful to log, not less. Please preserve or improve the diagnostic output.
8adcdf6 to
9c9860f
Compare
Closes #339
This PR addresses the issue that test code affect scoring process.
Before
After
Implementation details
ScoringCategoryenum. Possible values areTEST(if it's a test file),SOURCE(if scoring method istree_sitterand it's not a test file),NON_CODE(everything else: recognized non-code changes, unrecognized non-code changes, deleted files, binary files)PrScoringResultCategorized, which holdsPrScoringResultper each existing category.scoring.pyintoPrScoringResultCategorizedmethods.Test cases
Test files:
test_adding_tests_does_not_reduce_score- Adding test files to a source PR never lowers the base scoretest_tests_do_not_affect_contribution_bonus- Small and large test files produce modest, similar increases (test weight is 0.05x)test_same_code_in_test_path_scores_much_lower- Identical code in a test directory scores 10x+ lower than in a source pathtest_tests_do_not_affect_threshold- Test files can't push a below-threshold PR past the token score thresholdNon-code files:
test_adding_non_code_files_does_not_reduce_score- Adding non-code files (markdown, yaml) never lowers the base scoretest_non_code_does_not_affect_contribution_bonus- Small and large non-code files produce the same density increase (no bonus impact)test_source_code_scores_much_higher_than_non_code- Tree-diff scored source code scores 10x+ higher than line-count scored non-code filestest_non_code_does_not_affect_threshold- Non-code files can't push a below-threshold PR past the token score thresholdZero-score files:
test_deleted_file_does_not_change_score- Deleted files (score=0) don't affect the base scoretest_unsupported_file_does_not_change_score- Unsupported extensions (score=0) don't affect the base scoreDensity:
test_adding_test_category_increases_score_beyond_single_cap- Per-category density cap allows multiple categories to contribute independentlytest_verbose_formatting_decreases_score- Same logic in more lines produces lower density and lower scoretest_modified_file_scores_diff_only- Modified files score only the AST diff, not the entire fileThreshold:
test_below_threshold_scores_less- Trivial changes below token score threshold score less than substantial changes